Student id: 1155228903
First, download orf_trans.fasta (S. cerevisiae S288C) from yeastgenome.org using wget.
xxxxxxxxxx$ mkdir -p ~/5030/hw2/q1$ cd ~/5030/hw2/q1$ wget http://sgd-archive.yeastgenome.org/sequence/S288C_reference/orf_protein/orf_trans.fasta.gz
--2024-10-23 16:24:28-- http://sgd-archive.yeastgenome.org/sequence/S288C_reference/orf_protein/orf_trans.fasta.gzResolving sgd-archive.yeastgenome.org (sgd-archive.yeastgenome.org)... 52.218.132.178, 52.218.242.114, 52.218.237.242, ...Connecting to sgd-archive.yeastgenome.org (sgd-archive.yeastgenome.org)|52.218.132.178|:80... connected.HTTP request sent, awaiting response... 200 OKLength: 2603046 (2.5M) [application/x-gzip]Saving to: ‘orf_trans.fasta.gz’
orf_trans.fasta.gz 100%[=================================================================>] 2.48M 304KB/s in 7.4s
2024-10-23 16:24:36 (345 KB/s) - ‘orf_trans.fasta.gz’ saved [2603046/2603046]
$ gunzip orf_trans.fasta.gzThen create a database of these sequences.
xxxxxxxxxx$ makeblastdb -in orf_trans.fasta -dbtype prot -out yeast_protein_db
Building a new DB, current time: 10/23/2024 16:24:49New DB name: /hdd1/home/f24_htczhu/5030/hw2/q1/yeast_protein_dbNew DB title: orf_trans.fastaSequence type: ProteinKeep MBits: TMaximum file size: 3000000000BAdding sequences from FASTA; added 6039 sequences in 0.16837 seconds.In order to find sequences that are similar to others, you need to blastp this queries against itself (using the database created in step2).
xxxxxxxxxx$ blastp -query orf_trans.fasta -db yeast_protein_db -out blastp_output.txt -max_target_seqs 5 -evalue 1e-5 -outfmt 6At the end, summarize the blast output and filter the proteins that are similar to the other in the yeast exome. Use wc command to show the number of proteins in your output file.
xxxxxxxxxx$ wc -l blastp_output.txt
14880 blastp_output.txtPrepare the working directory and .sh files:
xxxxxxxxxxFile structure: |--- data | |--- trimmed_fastq_small | |--- ref_genome | |--- results |--- sam |--- bam |--- bcf |--- vcfxxxxxxxxxx# Inital prepare$ mkdir -p ~/5030/hw2/q2$ cd ~/5030/hw2/q2$ [ -f run_mapping.sh ] || touch run_mapping.sh$ [ -f run_variant_calling.sh ] || touch run_variant_calling.sh$ chmod +x run_mapping.sh$ chmod +x run_variant_calling.sh
# Prepare testing file: ecoli_rel606.fasta ...$ mkdir -p data/trimmed_fastq_small$ cd data/trimmed_fastq_small$ for file in /hdd1/shared/5030-lec6/data/trimmed_fastq_small/*; do ln -s $file; done$ mkdir -p ~/5030/hw2/q2/data/ref_genome $ cd ~/5030/hw2/q2/data/ref_genome$ ln -s /hdd1/shared/5030-lec6/data/ref_genome/ecoli_rel606.fasta ecoli_rel606.fasta$ cd ~/5030/hw2/q2$ mkdir -p results/sam results/bam results/bcf results/vcfContent of the file run_mapping.sh (download):
# Inputs: # paired_end_1 - the path to the first paired-end read file# paired_end_2 - the path to the second paired-end read file# reference_genome - the path to the reference genome file# output_bam - the prefix path for the output bam files
paired_end_1=$1paired_end_2=$2reference_genome=$3output_bam=$4 # only <prefix> !!!
# Index the reference genome for use by BWAbwa index $reference_genome
# Align the reads to the reference genomebwa mem $reference_genome $paired_end_1 $paired_end_2 > ${output_bam}.aligned.sam
# Convert the SAM file to BAM formatsamtools view -S -b ${output_bam}.aligned.sam > ${output_bam}.aligned.bam
# Sort and index the BAM filesamtools sort -o ${output_bam}.aligned.sorted.bam ${output_bam}.aligned.bamsamtools index ${output_bam}.aligned.sorted.bam
echo "Mapping is complete."Content of the file run_variant_calling.sh (download):
# Inputs:# sorted_bam_file - the path to the sorted and indexed BAM file# reference_genome - the path to the reference genome file# output_vcf - the prefix path for the output VCF file
sorted_bam_file=$1reference_genome=$2output_vcf=$3 # only <prefix> !!!
# Calculate the coverage using mpileupbcftools mpileup -O b -o ${output_vcf}_raw.bcf -f $reference_genome $sorted_bam_file
# Call variants using bcftoolsbcftools call --ploidy 1 -m -v -o ${output_vcf}_variants.vcf ${output_vcf}_raw.bcf
# Filter and report the SNPsvcfutils.pl varFilter ${output_vcf}_variants.vcf > ${output_vcf}_final_variants.vcf
# Optionally, visualize the aligned reads (use space key to move around)samtools tview $sorted_bam_file $reference_genome
echo "Variant calling is complete."Run these scripts (example):
$ bash run_mapping.sh \ data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq \ data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq \ data/ref_genome/ecoli_rel606.fasta \ results/bam/SRR2584866 [bwa_index] Pack FASTA... 0.04 sec[bwa_index] Construct BWT for the packed sequence...[bwa_index] 1.04 seconds elapse.[bwa_index] Update BWT... 0.02 sec[bwa_index] Pack forward-only FASTA... 0.02 sec[bwa_index] Construct SA from BWT and Occ... 0.33 sec[main] Version: 0.7.18-r1243-dirty[main] CMD: bwa index data/ref_genome/ecoli_rel606.fasta[main] Real time: 1.441 sec; CPU: 1.439 sec[M::bwa_idx_load_from_disk] read 0 ALT contigs[M::process] read 77446 sequences (10000033 bp)...[M::process] read 77296 sequences (10000182 bp)...[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (48, 36728, 21, 61)[M::mem_pestat] analyzing insert size distribution for orientation FF...[M::mem_pestat] (25, 50, 75) percentile: (420, 660, 1774)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 4482)[M::mem_pestat] mean and std.dev: (784.68, 700.87)[M::mem_pestat] low and high boundaries for proper pairs: (1, 5836)[M::mem_pestat] analyzing insert size distribution for orientation FR...[M::mem_pestat] (25, 50, 75) percentile: (221, 361, 576)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1286)[M::mem_pestat] mean and std.dev: (412.89, 227.06)[M::mem_pestat] low and high boundaries for proper pairs: (1, 1641)[M::mem_pestat] analyzing insert size distribution for orientation RF...[M::mem_pestat] (25, 50, 75) percentile: (560, 2011, 2594)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 6662)[M::mem_pestat] mean and std.dev: (1580.30, 978.54)[M::mem_pestat] low and high boundaries for proper pairs: (1, 8696)[M::mem_pestat] analyzing insert size distribution for orientation RR...[M::mem_pestat] (25, 50, 75) percentile: (320, 549, 942)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 2186)[M::mem_pestat] mean and std.dev: (581.31, 431.43)[M::mem_pestat] low and high boundaries for proper pairs: (1, 2808)[M::mem_pestat] skip orientation FF[M::mem_pestat] skip orientation RF[M::mem_pestat] skip orientation RR[M::mem_process_seqs] Processed 77446 reads in 2.649 CPU sec, 2.598 real sec[M::process] read 75458 sequences (10000048 bp)...[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (65, 36742, 10, 60)[M::mem_pestat] analyzing insert size distribution for orientation FF...[M::mem_pestat] (25, 50, 75) percentile: (172, 452, 1166)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3154)[M::mem_pestat] mean and std.dev: (645.66, 709.63)[M::mem_pestat] low and high boundaries for proper pairs: (1, 4148)[M::mem_pestat] analyzing insert size distribution for orientation FR...[M::mem_pestat] (25, 50, 75) percentile: (224, 368, 582)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1298)[M::mem_pestat] mean and std.dev: (417.93, 229.24)[M::mem_pestat] low and high boundaries for proper pairs: (1, 1656)[M::mem_pestat] analyzing insert size distribution for orientation RF...[M::mem_pestat] (25, 50, 75) percentile: (2317, 2353, 2422)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (2107, 2632)[M::mem_pestat] mean and std.dev: (2319.12, 75.27)[M::mem_pestat] low and high boundaries for proper pairs: (2002, 2737)[M::mem_pestat] analyzing insert size distribution for orientation RR...[M::mem_pestat] (25, 50, 75) percentile: (244, 513, 1366)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3610)[M::mem_pestat] mean and std.dev: (781.13, 858.61)[M::mem_pestat] low and high boundaries for proper pairs: (1, 4732)[M::mem_pestat] skip orientation FF[M::mem_pestat] skip orientation RF[M::mem_pestat] skip orientation RR[M::mem_process_seqs] Processed 77296 reads in 2.362 CPU sec, 2.268 real sec[M::process] read 74054 sequences (10000155 bp)...[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (58, 35818, 13, 79)[M::mem_pestat] analyzing insert size distribution for orientation FF...[M::mem_pestat] (25, 50, 75) percentile: (288, 553, 1479)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3861)[M::mem_pestat] mean and std.dev: (822.66, 856.80)[M::mem_pestat] low and high boundaries for proper pairs: (1, 5052)[M::mem_pestat] analyzing insert size distribution for orientation FR...[M::mem_pestat] (25, 50, 75) percentile: (224, 361, 576)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1280)[M::mem_pestat] mean and std.dev: (414.99, 227.46)[M::mem_pestat] low and high boundaries for proper pairs: (1, 1632)[M::mem_pestat] analyzing insert size distribution for orientation RF...[M::mem_pestat] (25, 50, 75) percentile: (1176, 2230, 2643)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 5577)[M::mem_pestat] mean and std.dev: (1799.00, 921.05)[M::mem_pestat] low and high boundaries for proper pairs: (1, 7044)[M::mem_pestat] analyzing insert size distribution for orientation RR...[M::mem_pestat] (25, 50, 75) percentile: (365, 601, 1880)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 4910)[M::mem_pestat] mean and std.dev: (1144.51, 1264.10)[M::mem_pestat] low and high boundaries for proper pairs: (1, 6425)[M::mem_pestat] skip orientation FF[M::mem_pestat] skip orientation RF[M::mem_pestat] skip orientation RR[M::mem_process_seqs] Processed 75458 reads in 2.356 CPU sec, 2.282 real sec[M::process] read 45746 sequences (6219545 bp)...[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (41, 35149, 14, 68)[M::mem_pestat] analyzing insert size distribution for orientation FF...[M::mem_pestat] (25, 50, 75) percentile: (321, 488, 1274)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3180)[M::mem_pestat] mean and std.dev: (607.09, 527.87)[M::mem_pestat] low and high boundaries for proper pairs: (1, 4133)[M::mem_pestat] analyzing insert size distribution for orientation FR...[M::mem_pestat] (25, 50, 75) percentile: (220, 353, 566)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1258)[M::mem_pestat] mean and std.dev: (407.78, 223.19)[M::mem_pestat] low and high boundaries for proper pairs: (1, 1604)[M::mem_pestat] analyzing insert size distribution for orientation RF...[M::mem_pestat] (25, 50, 75) percentile: (652, 2237, 2446)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 6034)[M::mem_pestat] mean and std.dev: (1704.23, 991.13)[M::mem_pestat] low and high boundaries for proper pairs: (1, 7828)[M::mem_pestat] analyzing insert size distribution for orientation RR...[M::mem_pestat] (25, 50, 75) percentile: (231, 409, 1018)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 2592)[M::mem_pestat] mean and std.dev: (513.68, 468.59)[M::mem_pestat] low and high boundaries for proper pairs: (1, 3379)[M::mem_pestat] skip orientation FF[M::mem_pestat] skip orientation RF[M::mem_pestat] skip orientation RR[M::mem_process_seqs] Processed 74054 reads in 2.340 CPU sec, 2.266 real sec[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (40, 21701, 9, 32)[M::mem_pestat] analyzing insert size distribution for orientation FF...[M::mem_pestat] (25, 50, 75) percentile: (235, 720, 1382)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3676)[M::mem_pestat] mean and std.dev: (770.19, 676.46)[M::mem_pestat] low and high boundaries for proper pairs: (1, 4823)[M::mem_pestat] analyzing insert size distribution for orientation FR...[M::mem_pestat] (25, 50, 75) percentile: (219, 350, 560)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1242)[M::mem_pestat] mean and std.dev: (404.41, 221.74)[M::mem_pestat] low and high boundaries for proper pairs: (1, 1583)[M::mem_pestat] skip orientation RF as there are not enough pairs[M::mem_pestat] analyzing insert size distribution for orientation RR...[M::mem_pestat] (25, 50, 75) percentile: (325, 557, 721)[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1513)[M::mem_pestat] mean and std.dev: (501.66, 309.62)[M::mem_pestat] low and high boundaries for proper pairs: (1, 1909)[M::mem_pestat] skip orientation FF[M::mem_pestat] skip orientation RR[M::mem_process_seqs] Processed 45746 reads in 1.457 CPU sec, 1.421 real sec[main] Version: 0.7.18-r1243-dirty[main] CMD: bwa mem data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq[main] Real time: 10.964 sec; CPU: 11.308 secMapping is complete.
$ bash run_variant_calling.sh \ results/bam/SRR2584866.aligned.sorted.bam \ data/ref_genome/ecoli_rel606.fasta \ results/vcf/SRR2584866 [mpileup] 1 samples in 1 input files[mpileup] maximum number of reads per input file set to -d 250Variant calling is complete.